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Evidence is given for a systematic text-length dependence of the power-law index 7 of a single 
book. The estimated 7 values are consistent with a monotonic decrease from 2 to 1 with increasing 
length of a text. A direct connection to an extended Heap's law is explored. The infinite book limit 
is, as a consequence, proposed to be given by 7 = 1 instead of the value 7 = 2 expected if the Zipf 's 
law was ubiquitously applicable. In addition we explore the idea that the systematic text-length 
dependence can be described by a meta book concept, which is an abstract representation reflecting 
the word- frequency structure of a text. According to this concept the word- frequency distribution 
of a text, with a certain length written by a single author, has the same characteristics as a text of 
the same length pulled out from an imaginary complete infinite corpus written by the same author. 



I. INTRODUCTION 

The development of the spoken and written language 
is one of the major transitions in evolution jjj]. It has 
given us the advantage to easily and efficiently transfer 
information between individuals and even between gen- 
erations. It could be argued that it is clear why language 
was evolved in general, but it is harder to explain the 
reason for its structure. The structure of language has 
been studied as early as the Iron age in India and is still, 
to this day, a popular subject. 

The field had a boost after George Kingsley Zipf, 
around 75 years ago, found an empirical law (Zipf's law) 
describing a seemingly universal property of the writ- 
ten language. It states that the number of occurrences 
of a word in a long enough written text falls off as l/r 
where r is the occurrence-rank of a word (the smaller 
rank, the more occurrences) [3 [3 ■ This in turn 
means that the normalized word-frequency distribution 
(wfd) follows the expression P{k) oc where P{k) 

is the probability to find a word which appears k times 
in a text Q. This empirical law is generally believed 
to represent some ubiquitous nature of the wfd, and has 
inspired the development of several models reproducing 
this structure [7[@]- However, empirically one typically 
finds that the wfd follows a power-law distribution with 
an exponent smaller than 2 It was also reported 

in Ref. [To] that the exponent (commonly denoted as 7) 
for a power-law description of the wfd seems to change 
with the length of a text, rather than being constant. 

Another property is the number of different (unique) 
words, N ^ as a function of the total number of words 
in a book, M (In this context a book is a sequence of 
words where words are defined as collections of letters 
separated by spaces) . The conventional way of describing 
this relation is by using Heap's law [ll| , which states that 
N oc M", where < a < 1 is a constant. 

In this paper we present, and give evidence for, a meta 
book concept which is an abstract picture of how an au- 
thor writes a text. We suggest a systematic text- length 
dependence for the wfd which is directly connected to an 
extended Heap's law with an a changing from 1 to as 
the text length is increased from M = 1 to infinity. 



II. THE META BOOK CONCEPT 

We start by studying the above mentioned property, 
N{M). Figure 1 shows this curve for three different au- 
thors (Hardy, Melville and Lawrence). We have cre- 
ated very large books by attaching novels together, in 
order to extend the range of book sizes (see appendix 
A for a full list of books). The curve shows a decreas- 
ing rate of adding new words which means that N grows 
slower than linear (a < 1) ^j^KJj. Also, for a real book, 
N{M = 1) = 1 which means that the proportionality 
constant in Heap's law must be one. So, if = M" then 
a = \nN/\nM. This quantity is plotted in the inset of 
Fig. 1 and the data shows that a is decreasing as a func- 
tion of the size, ruling out the possibility to accurately 
describe the A^(Af)-curve using a constant a. A plausible 
scenario would be that a continue to decrease assymp- 
totically towards zero as M reaches infinity. This would 
mean that the A^(A/)-curve saturates and N{M 00) is 
finite. 

When the length of a text is increased, the number of 
different words is also increased. However, the average 
usage of a specific word is not constant, but increases as 
well. That is, we tend to repeat the words more when 
writing a longer text. One might argue that this is be- 
cause we have a limited vocabulary and when writing 
more words the probability to repeat an old word in- 
creases. But, at the same time, a contradictory argument 
could be that the scenery and plot, described for example 
in a novel, are often broader in a longer text, leading to 
a wider use of ones vocabulary. There is probably some 
truth in both statements but the empirical data seem to 
suggest that the dependence of N on M reflects a more 
general property of an authors language. 

For every size of a text, the average occurrence for a 
word can be calculated as (fc) — M/N. This means that 
the A^(M)-curve can be converted into a curve for the 
average frequency as j^{AI) — {k){M). This curve is 
shown in Fig. 2a-c for the three different authors. Each 
point represent a real book or a collection of books and 
the curves represent the (fc)(Af)-curve for the full collec- 
tion of books for each author (i.e. same data as in Fig. 
1). The data is plotted as l/(fc) as a function of 1/M in 
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FIG. 1: The number of diflterent words, A'^, as a function of the 
total number of words, AI, for the authors Hardy, Melville 
and Lawrence. The data represents a collection of books by 
each author. The inset shows the exponent q = In N/ In M 
as a function of M for each author. 
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order to get a feeling for the asymptotic behavior as M 
reaches infinity. 

The overlap between the line and the points means 
that the average frequency of a word (and consequently 
also N) in a short story is to good approximation the 
same as for a section, of equal length, from a larger text 
written by the same author. Note that the texts has to be 
written by the same author since the overlap would not 
be nearly as good if books by Lawrence were compared 
to the curve by Melville. 

In Fig. 2d-f, we literally pull out sections from a very 
large book and compare the result to a much smaller 
book (with a size difference of a factor n). The figures 
are showing the wfd for an nth part (averaged over 200 
sections) of the full collection and a short story by the 
same author. The distribution for the full collection is 
also included for comparison. The overlap between the 
short story and the section of the big book implies that 
the wfd for a text can be recreated by taking a section 
of a larger book written by the same author. It does not 
matter if we pull out half of a book of size M, or a fourth 
of a book of size 2M. 

These findings lead us towards the meta book concept: 
The writing of a text can be described by a process where 
the author pulls a piece of text out of a large mother book 
(the meta book) and puts it down on paper. This meta 
book is an imaginary infinite book which gives a represen- 
tation of the word frequency characteristics of everything 
that a certain author could ever think of writing. This 
has nothing to do with semantics and the actual mean- 
ing of what is written, but rather to the extent of the 



FIG. 2: Evidence in favor of the meta book concept: (a)-(c) 
The average frequency for a word as a function of the size 
of the book (M) plotted as l/{k){jj) for the three authors. 
The long dashed, short dashed and dotted curves correspond 
to the A^(A/)-curve as N/M{-^) for the biggest collection of 
books by each respective author. The (fe) for a small book 
is close to the same as for a section (of the same size) of 
the bigger book, (d)-(f) The word frequency distribution for 
an nth-part (triangles) of the full collection of books (filled 
circles) compared to a small book (open circles) of the same 
size as the nth-part. The wfd is approximately the same for 
a small book as for a section (of the same size) of a big book. 



vocabulary, the level and type of education and the per- 
sonal preferences of an author. The fact that people have 
such different backgrounds, together with the seemingly 
different behavior of the function N{M) for the different 
authors, opens up for the speculation that every person 
has its own and unique meta book, in which case it can 
be seen as a fingerprint of an author. 

Yet another, more obvious, property is the frequency 
of the most common word, kmax- When dividing a book 
in half, kmax should also be cut in half. This linear re- 
lation between kmax and M is shown in Fig. 3 to be in 
agreement with the real data, which is consistent with 
the meta book concept. This follows because the most 
common word is most likely a "filling word" (e.g. "the") 
which would be evenly distributed throughout the text 
(e.g. every twentieth word or so). 

So far we have been sectioning down books into smaller 
sizes, according to the meta book concept. But what 
happens if we go in the other direction and extrapolate 
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FIG. 3: The frequency of the most common word, kmax, as a 
function of the size of the book, M, for the three authors in 
a log-log scale. 



to larger sizes? What could the meta book look like? In 
the next section we obtain the size dependences for the 
parameter values of the wfd in terms of a and present 
the asymptotic limit of a = 0. 



III. SIZE DEPENDENCE OF THE WFD 

To find the size dependence of the wfd we notice that 
there is a simple relation between the wfd and the (k). 
If the (k) (which is directly related to the A^(M)-curve, 
and thus to a) is changing with the size, the wfd also has 
to change in some way (e.g. smaller cut off or changed 
slope) . But we also know that the tail of the distribution 
must be regulated in such a way that the maximum fre- 
quency does not go crazy (e.g. 90% of all the words are 
the same), but is consistent with Fig. 3. Given a func- 
tional form, what kind of relation between the functional 
parameters is needed to balance these requirements? The 
requirements mentioned can be summarized in three ba- 
sic assumptions supported by our previous analyses: 

1 The number of different words, N, scales as the to- 
tal number of words, M, to some power that can 
change with M, iV oc M", where a = a{M) can 
range between 1 and 0. This means that the aver- 
age frequency scales like (fc) oc M^~". 

2 The value, kmax, defined through the cumulative 
word-frequency distribution as F{kmax) — 1/-^, 
should increase linearly with the size of the book. 



That is, k„ 
than zero. 



eM , where e is a constant larger 



3 The word-frequency distribution of a book is to a 
good approximation of the form 



P{k) = A- 



-bk 



kf 



(1) 



where A, b and 7 may depend on M, so that A = 
A{M), b = 6(M) = boM-f^ {(3 > 0) and 7 = 7(M). 

The fact that N can be expressed as TV (x M"*^^^^ is al- 
ways true. The implicit assumptions made is that a{M) 
is a slowly and monotonically decreasing function from 
a{l) = 1 to limM^oo a{M) = 0. That a slowly varying 
a can describe N is plausible since a fair approximation 
is usually obtained by just a constant a in the range 
< a < 1 (Heap's law). The limit a{l) = 1 is just the 
observation that the first couple of words one writes in a 
book are usually different, and the limit a{M — > 00) = 
is the extreme limit where the author's vocabulary has 
been used so that no new words are added and the in- 
crease of N approaches zero @ . 

The second assumption reflects the statement that if 
the most common word used by an author is "the" and 
one compares two text-lengths by the same author, where 
one is twice as long as the other, then the longer text 
contains on the average twice as many "the"s as the 
shorter one. This statement can be expressed in terms 
of the cumulative normalized wfd, which is defined as 
Fik') = EZk' Pik). Thus Fikmax) = 1/N means that 
if a data set is created by drawing N random numbers 
from a theoretical and continuous function, P{k), then 
one would get, on the average, one word appearing with 
a frequency larger than kmax- This word, with frequency 
kmax, would then become the most common word in 
the text. So, kmax is a theoretical limit, while kmax is 
the actual frequency of the most common word. Since 
the distribution P{k) is a rapidly decreasing function for 
large k, the most common word always appear with a 
frequency very close to kmax {kmax « kmax)- It follows 
that kmax cx M, which means that kmax = is a valid 
assumption to a good approximation. 

The first two assumptions can be expressed in the con- 
tinuum approximation as two integral equations: 



{k)M = j fcPM(fc)dfc oc Afi-"(^^) (2) 
— = / PM(fc)&ocM-"(*^) (3) 

^^M JeM 

The third assumption is based on the notion that this 
functional form fits well to empirical data The 
basic assumption made in the present context is that the 
power law with an exponential gives the correct large k 
behavior and that A and 7 vary slowly with M . 

Next, we explore the consequences of the basic three 
assumptions but first the normalization condition is in- 
vestigated. From Eq. [T] we get 
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for > &o ^ 
' Ak. ^1 - I if7>l 
A cx M^^^i-^) if 7 < 1 



(4) 



So, as long as 7 > 1 and M is sufficiently large we have 
no explicit M dependence for A. That is, if 7 is constant 
or varies slowly enough, we can treat A as constant. 
The next step is to evaluate Eq. [2] by inserting Eq. [TJ 
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{k)M = ^ 



if 7 > 2 

(k)M oc M^^^^-T) if ^ < 2 



(5) 



The last case in Eq. [7] (/3 < 1) can be disregarded 
as impossible since 7 needs to be smaller than one for 
the integral to be positive, which means that a is also 
negative. This would give a book where the number of 
different words decreases as a function of the total num- 
ber of words. However, the case of /3 > 1 together with 
Eq. [3] gives the relation 1/iV oc Af~" oc Af^~^ and con- 
sequently a = 7 — 1, or 



7 



(8) 



Finally, substituting Eq. [8] into Eq. [6] locks down the 
value of f3 to be one, and the wfd (given the previously 
assumed form) becomes: 



PM{k) = A 



(9) 



for large M. 

Note that if a goes to zero as M goes to infinity, then 
7 will move infinitely close to one, and this should be 
true /or all authors. Nevertheless, different authors might 
reach this point in different ways. Taking the limit M 
going to infinity for Eq. [9] {Bq/M, a{M) 0) then gives 
us the functional form of the wfd for an infinite book: 



According to Eq.[5]and[2]7 > 2 means that the average 
usage of a word is independent of the size of the book, 
so that M/N — const and consequently oc M (a = 1). 
That is, the number of different words grows linearly with 
the size of the book. Solving for 7 in this case gives 



7 = 1- 



i-i/(fc 



Simon model ^ , where a text grows linearly as N — ^ 
with preferential repetition. Here we instead arrive at 
this result from the assumed functional form, without 
introducing any type of growth or preferential element. 

However, the crucial point is that if 7 < 2, then 
Ml-" oc M/^^^-^t) and a = 1 - /3(2 - 7), or 



This is also the analytic solution for the 



A 
k' 



(10) 



In practice though, bo/M and a{M) will never be exactly 
zero. 

So far, we have shown that the meta book concept is 
supported by empirical data. We have also derived an ex- 
pression for the size dependence of the parameters of the 
wfd, given a functional form. These are in some sense 
two independent findings which are connected through 
the exponent a. Next we show that the derived expres- 
sion for the wfd (Eq. [5]) is consistent with the real data 
and that the process of pulling sections out of a large 
book recreates the observed size dependence in a. 
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2-^(1-.). 



(6) 



IV. SIZE DEPENDENCE IN REAL BOOKS 



Thus, we have a relationship between 7 and a, so the 
power-law exponent is determined by the rate at which 
new words are introduced. 

The second assumption (Eq. [3]), with 7 > 1, gives the 
relation 
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for large M 



^ocMi-^ if/3>l 
^ocAf/5(i-^) if/3<l 



To validate the assumption that a approaches zero as 
M increases, we need to fit the real data to an appropri- 
ate functional form. This functional form needs to satisfy 
two constraints: (i) a{M) should be a monotonically de- 
creasing function with the asymptotic limit for large M 
equal to zero; (ii) N = M"^*^) should be a monotonically 
increasing function (by definition the number of unique 
words never decreases). These constraints result in the 
condition 



a(M) > -M\nM—a(M), 
' - dM ^ ' 



(11) 



where the equality gives the solution l/a(M) = ulnAf, 
where u is an arbitrary constant. In order to parametrize 
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FIG. 4: The exponent a = InA^/lnM as a function of M 
for each author together with the corresponding fits to Eq. 
1121 The fitting-parameter values {u, v) are for Hardy (0.0420, 
0.772), for Melville (0.0394, 0.777) and for Lawrence (0.0366, 
0.849). 



a{M) we introduce an additional paramater, v, giving 
the expression l/a{M) ~ wlnM + w which obeys the 
inequahty in Eq. [11] if w > 0. The final parametrization 
to desrcibe a is then 



dnM 



(12) 



The limiting value for A^, given Eq.[T2l is limM-»oo ^ — 
liniM^oo A^"*'^^' = e^/". Note that this parametrization 
is a generalization of Heap's law (a = const if m = 0). 
We obtain a good fit for this parametrization for all three 
authors, as shown in Fig. 4 where we are ignoring the 
first 2-10^ words since we are interested in the large M 
behavior. However, the resulting fit for N{A'I) — M"*^^^) 
is very resonable also for small M. 

The main point is not to get the exact extrapolation 
behavior for each author but to show that they are all in 
accordance with the suggested functional form of a{M), 
telling us that the empirical data is consistent with a 
going to zero. 

The three assumptions in the previous section lead to 
the specific form of the wfd in terms of a{AI) (Eq. 
In Fig. 5 this result is compared to the real data for two 
authors (columns) and for each author, three different 
book sizes (rows), since ^ is a normalization constant 
and a{M) — In TV/ In M there is essentially only one free 
parameter, bo. This parameter is a characteristics of the 
author and according to the above analysis is indepen- 
dent of the length of the text. In other words, once the 
authors characteristic bg is determined then the parame- 
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FIG. 5: The wfd for three different books (rows) of different 
sizes, written by Hardy (a-c) and Melville (d-f) together with 
the function given by Eq. [1] The parameters are given by 
h = bo/M and 7 = 1-1- q(M) (according to Eq. [U]), where 
bo = 25.5 for Hardy and 23.8 for Melville. 



ter b for a text of length M by the same author is given 
by & = bo/M. The agreement suggests that the analysis 
leading to Eq. [S] is indeed valid. 

The empirical data seem consistent with the size de- 
pendence derived for the wfd with b = bo/M and 7 — 
1 + a{M). But what is causing the peculiar form of 
a(M)? Our suggestion is that the actual sectioning of 
a book is responsible for creating such a structure. This 
can be tested by applying the meta book concept on a 
large hypothetical book. 

The actual process of pulling a section out of a book 
can be described analytically by a combinatorial transfor- 
mation, provided one assumes that the words in a book 
are uniformly distributed [lo| . For instance, if the word 
"the" exists k' times in a book, then the probability to 
get k "the", when taking half {n = 1/2) of that book, is 
given by the binomial distribution. This can be general- 
ized for any n (Eq. [T3|) and is called the Random Book 
Transformation (RBT) 6]1^. This transformation de- 
scribes how the wfd changes when a section of size M is 
pulled out from a bigger book of size M' 
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FIG. 6: The average frequency of a word, {k), as a function 
of the total number of words, M. The line shows the real 
data of the full collection by Hardy and the circles shows the 
result obtained from the RBT starting at the wfd for the full 
Hardy in Fig. 5a, i.e. Pu'ik) = A exp(-0.000019fc)/fc^ ''^^ 
The dotted line correspond to the analytic solution (k) — 
M^-"< = M^-" = M°■2^^ for a constant 7 = 1.732. 



PM{k) = CY,Akk'PM-{k') (13) 

k' = k 

where n = M' /M ^ C is the normalization constant and 
Akk' is the triangular matrix with the elements 

(14) 

To analyze the behavior of the RBT we start with 
the theoretical wfd for the full Hardy from Fig. 5a 
{PM'{k) = ylexp(-0.000019/t)/A:i-^32^ transform it 

down to smaller sizes, calculating the average frequency 
for each size, M, according to the formula 



k=l 

where Puik) is given by Eg. [T3l 

In Fig. 6, the (A;)m is plotted in a log-log scale for the 
data created by the RBT, as circles, and the full line 
represent the real data for the full Hardy (same data as 
the line in Fig. 2a). The dotted line show the corre- 
sponding analytic result (A:) = M^-t = M^"" = M°-268 
(7 = 1.732), for a constant a and 7. The figure shows 
the similar behavior of the RBT and the real data. 



V. CONCLUSIONS 

In the present paper we have discussed the text-length 
dependence of the wfd of single authors. Evidence is 
given for a systematic decrease in the power-law index 7 
of the wfd, from 7 « 2 for short novels to the infinite book 
size limit with 7 = 1. This systematic change is linked 
to the text-length dependence of the number of unique 
words A*" as a function of the total number of words M . 

We have shown empirically that the size dependence of 
the wfd (and also N and (fc) as a function of M) display 
a very similar behavior to sectioning down a large book. 
It was also demonstrated, through the use of the RBT, 
that the same process can reproduce the observed de- 
crease of a. This has led us to introduce the concept of a 
meta book, which is an imaginary book of infinite length 
written by an author, as a description of this behavior. 
Furhtermore, the meta book should have a wfd close to 
P{k) — A/k. The meta book should contain all the sta- 
tistical properties of a real text, related to the specific 
writing style of an author, which are then transferred to 
the real book when pulled out of this meta book. It is 
important to remember that this is an abstract descrip- 
tion, and novels (or text sections in novels), written by a 
single author, of length M are on the average character- 
ized by PM{k). One may also note that the meta book 
is a holistic concept, which implies that any text length 
written by the author carries information about the total 
extent of the author's vocabulary; The Pm (fc)-average for 
a text-section of size M is independent of the total size 
M' of the book. 

It is interesting to compare with the related phenom- 
ena of family name distributions where the 7=1 limit 
is realizable [lB|[I3- ^^i^ case, M corresponds to the 
number of inhabitants of a country or town, N to the 
number of different family names, and P{k) to the corre- 
sponding frequency distribution of family names. For a 
country like USA or a town like Berlin P{k) oc k~^ with 
7 « 2 However, for Vietnam 7 ~ 1.4 and for 

Korea 7 ~ 1 [l^ . This decrease of 7 is correlated with 
a corresponding decrease of a in TV oc M". Thus the 
less the number of family names increases with the size 
of the population, the less becomes 7, until the limiting 
case 7 = 1 and a = is reached. For Korea the empirical 
finding is TV oc InM , which indeed corresponds to a = 0. 
In fact, the relation between the exponents 7 = 1 + 0; was 
also achieved in Ref. 1^ for the case of family names, 
suggesting that the relation between Pnik) and N{M) 
is more general than suggested here, and could hold for 
different kinds of systems. 
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TABLE I: Collection of books used as data. The authors 
are Thomas Hardy (TH), Herman MelviUe (KM) and David 
Herbert Lawrence (DHL). 



Author 


Book title (abbr) 


M 


iV 


TH 


Greenwood Tree (GT) 


57,965 


6,645 




The Well-Beloved (WB) 


63,288 


6,985 




Two on a Tower (TT) 


94,849 


8,875 




The Trumpet-Major (TM) 


114,841 


9,328 




A Pair of Blue Eyes (BE) 


131,598 


10,533 




The Woodlanders (W) 


137,184 


10,566 




"rn J 1 TV T 11* 1 /AT \ 

From the Maddmg Crowd (MC) 


138,004 


11,797 




Desperate Remedies (DR) 


142,346 


10,333 




The Hand of Ethelberta (HE) 


142,894 


10,694 




Return of the Native (RN) 


142,931 


10,437 




Jude the Obscure (JO) 


146,557 


10,896 




Tess of the d'Urbervilles (TU) 


151,097 


12,159 


HM 


T 1 TV T ^^1 ■ /TT\ /r\ 

I and My Chimney (IM) 


11,525 


2,713 




Israel Potter (IP) 


65,545 


9,234 




The Confidence-Man (CM) 


94,644 


10,595 




Typee (T) 


108,080 


10,231 




Redburn. His First Voyage (RV) 


119,696 


11,535 




White Jacket (WJ) 


144,892 


13,710 




Moby Dick (MD) 


212,473 


17,226 


DHL 


The Prussian Officer (PO) 


9,115 


1,823 




Fantasia of the Unconscious (FU) 


61,972 


6,192 




Trespasser (T) 


71,506 


6,986 




Aaron's Rod (AR) 


114,384 


8,907 




The Lost Girl (LG) 


137,955 


10,427 




Sons and Lovers (SL) 


162,101 


9,606 




Woman in Love (WL) 


182,722 


11,301 
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Appendix A 



The empirical data used in this article are books 
written by three authors: Thomas Hardy, Herman 
Melville and David Herbert Lawrence (see table 1 
for a complete list of books). All books are taken 
from the online book catalog "Project Gutenberg" 
(http://www.gutenberg.org/catalog/). In order to esti- 
mate the behavior of very large books we attach together 
a collection of books for each author, by simply adding 
the books one after the other. Averages have been ob- 
tained by employing periodic boundary conditions and 
using different starting points in the book. This is a valid 
procedure since the words, to large extent, are uniformly 
distributed throughout the book, and statistically speak- 
ing, there is no such thing as a beginning or an end [lO| . 
This method gives a considerable reduction of statistical 
fluctuations. 

When presenting the word-frequency distribution we 
use a log2 binning where the size of the bins follows the 
formula 5, = 2'-i. 
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